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Abstract 

We  introduce  an  ensemble  learning  method  for  temporal  data  that  uses  a  mixture  of  hidden  Markov  models  (HMM). 
We  hypothesize  that  the  data  are  generated  by  K  models,  each  of  which  reflects  a  particular  trend  in  the  data.  The 
proposed  approach,  called  ensemble  HMM  (eHMM),  is  based  on  clustering  within  the  log-likelihood  space  and  has 
two  main  steps.  First,  one  HMM  is  fit  to  each  of  the  N  individual  training  sequences.  For  each  fitted  model,  we  evaluate 
the  log-likelihood  of  each  sequence.  This  results  in  an  N-by-N  log-likelihood  distance  matrix  that  will  be  partitioned 
into  K  groups  using  a  relational  clustering  algorithm.  In  the  second  step,  we  learn  the  parameters  of  one  HMM  per 
cluster.  We  propose  using  and  optimizing  various  training  approaches  for  the  different  K  groups  depending  on  their 
size  and  homogeneity.  In  particular,  we  investigate  the  maximum  likelihood  (ML),  the  minimum  classification  error 
(MCE),  and  the  variational  Bayesian  (VB)  training  approaches.  Finally,  to  test  a  new  sequence,  its  likelihood  is  computed 
in  all  the  models  and  a  final  confidence  value  is  assigned  by  combining  the  models'  outputs  using  an  artificial  neural 
network.  We  propose  both  discrete  and  continuous  versions  of  the  eHMM. 

Our  approach  was  evaluated  on  a  real-world  application  for  landmine  detection  using  ground-penetrating  radar 
(GPR).  Results  show  that  both  the  continuous  and  discrete  eHMM  can  identify  meaningful  and  coherent  HMM  mixture 
components  that  describe  different  properties  of  the  data.  Each  HMM  mixture  component  models  a  group  of  data 
that  share  common  attributes.  These  attributes  are  reflected  in  the  mixture  model's  parameters.  The  results  indicate 
that  the  proposed  method  outperforms  the  baseline  HMM  that  uses  one  model  for  each  class  in  the  data. 

Keywords:  Hidden  Markov  models;  Mixture  models;  Landmine  detection;  Ground-penetrating  radar;  Clustering 


1  Introduction 

Detection  and  removal  of  buried  landmines  is  a  worldwide 
humanitarian  and  military  problem.  The  latest  statistics 
[1]  show  that  in  2012,  a  total  of  3618  casualties  from  mines 
were  recorded  in  62  countries,  the  vast  majority  (78  %)  of 
casualties  were  civilians.  Detection  and  removal  of  land¬ 
mines  is  therefore  a  significant  problem  and  in  recent 
years  has  attracted  several  researchers.  One  challenge  in 
landmine  detection  lies  in  plastic  or  low  metal  mines 
that  are  difficult  to  detect  by  traditional  metal  detectors. 
Varieties  of  sensors  have  been  proposed  or  are  under 
investigation  for  landmine  detection.  Ground-penetrating 
radar  (GPR)  offers  the  promise  of  detecting  landmines 
with  little  or  no  metal  content.  Unfortunately,  landmine 
detection  via  GPR  has  proven  to  be  a  difficult  problem 
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[2,  3].  Although  systems  can  achieve  high  detection  rates, 
they  have  done  so  at  the  expense  of  high  false  alarm  rates. 
The  key  challenge  to  mine  detection  technology  lies  in 
achieving  a  high  rate  of  mine  detection  while  maintaining 
a  low  false  alarm  rate.  The  performance  of  a  mine  detec¬ 
tion  system  is  therefore  commonly  measured  by  a  receiver 
operating  characteristics  (ROC)  curve  that  specifies  the 
rate  of  true  detection  versus  the  rate  of  false  alarm. 

To  improve  the  overall  ROC  of  the  landmine  detection 
system,  several  algorithms  have  been  introduced  in  the 
last  decade.  These  algorithms  use  methods  such  as  fuzzy 
logic  [4],  hidden  Markov  models  [5-7],  nearest  neighbor 
classifiers  [8,  9],  support  vector  machines  [10],  or  random 
forest  [11]  to  assign  a  confidence  that  a  mine  is  present  at 
a  point. 

In  [5,  6],  hidden  Markov  modeling  was  proposed  for 
detecting  both  metal  and  nonmetal  mine  types  using 
data  collected  by  a  moving  vehicle-mounted  GPR  sys¬ 
tem.  These  initial  applications  have  proved  that  HMM 
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techniques  are  feasible  and  effective  for  landmine  detec¬ 
tion.  The  initial  work  relied  on  simple  gradient  edge 
features.  Subsequent  work  used  an  edge  histogram 
descriptor  (EHD)  approach  to  extract  features  from  the 
original  GPR  signatures.  The  baseline  HMM  classifier 
consists  of  two  HMM  models,  one  for  mine  and  one  for 
background.  The  mine  (background)  model  captures  the 
characteristics  of  the  mine  (background)  signatures.  The 
model  initialization  and  subsequent  training  are  based  on 
global  averaging  over  the  training  data  corresponding  to 
each  class. 

Most  subsequent  published  works  in  the  area  of  land¬ 
mine  detection  using  HMMs  focused  on  feature-level 
fusion  [12]  and/or  model-level  fusion  [13-15].  All  of  these 
methods  still  use  a  single  model  for  each  class.  In  this 
paper,  we  argue  that  a  single  model  is  not  sufficient  to 
capture  the  intra-class  variability.  In  the  context  of  land¬ 
mine  detection,  variations  in  the  class  of  mines  may  be 
caused  by  the  different  mine  types,  burial  depth,  soil 
type,  and  moisture.  Similarly,  background  signatures  may 
exhibit  large  variations  due  to  different  soil  conditions  and 
data  preprocessing  techniques.  To  generalize  the  HMM 
approach,  we  identify  the  variations  within  each  class  in  an 
unsupervised  manner  and  use  multiple  models  to  account 
for  the  intra-class  variations. 

The  proposed  approach  consists  of  the  construction  of 
a  mixture  of  HMMs  to  cover  the  diversity  of  the  training 
data.  This  approach,  called  ensemble  of  hidden  Markov 
models  (eHMM),  has  four  main  components:  similarity 
matrix  computation,  relational  clustering,  adaptive  train¬ 
ing  scheme,  and  decision  level  fusion.  These  components 
are  summarized  by  the  block  diagram  in  Fig.  1  and  will  be 
described  in  section  4. 

The  remainder  of  this  paper  is  organized  as  fol¬ 
lows.  Section  2  provides  background  material  on  hidden 
Markov  models.  Section  3  highlights  the  motivations  for 
adopting  multiple  models  in  our  approach.  Section  4  out¬ 
lines  the  eHMM  architecture  and  describes  its  different 
components.  Section  5  reports  the  experimental  results  of 
our  eHMM  approach  on  large  GPR  collections  and  com¬ 
pare  them  to  those  of  the  baseline  HMM  detector.  Finally, 
conclusions  are  provided  in  Section  6. 

2  Background 

2.1  Hidden  Markov  models 

An  HMM  is  a  model  of  a  doubly  stochastic  process  that 
produces  a  sequence  of  random  observation  vectors  at 
discrete  times  according  to  an  underlying  Markov  chain. 
At  each  observation  time,  the  Markov  chain  may  be  in  one 
of  N  states  {si, . . . ,  sm}  and,  given  that  the  chain  is  in  a  cer¬ 
tain  state,  there  are  probabilities  of  moving  to  other  states. 
These  probabilities  are  called  transition  probabilities.  Fet 
T  be  the  length  of  the  observation  sequence  (i.e.,  number 
of  time  steps),  let  O  =  {Oi, . . . ,  Oj}  be  the  observation 


sequence,  and  let  Q  =  {q\, . . . ,  qj}  be  the  state  sequence. 
The  compact  notation 

A.  =  (A,B,tt)  (1) 

is  generally  used  to  indicate  the  complete  parameter  set  of 
the  HMM  model.  In  (1),  A  =  [  a^}  is  the  state  transition 
probability  matrix,  where  atj  =  Pr(qt  =  j\qt-i  =  0  for 
i,j  =  1,  7t  =[i Ti\,  where  7T;  =  Pr(q\  =  si)  are 

the  initial  state  probabilities;  and  B  =  bi(Ot),  i  =  1, . . . ,  N, 
where  bi(Ot)  =  Pr(Ot\qt  =  0  is  the  observation 
probability  distribution  in  state  i. 

An  HMM  is  called  continuous  if  the  observation  prob¬ 
ability  density  functions  are  continuous  and  discrete  oth¬ 
erwise.  In  the  case  of  the  discrete  HMM,  the  observation 
vectors  are  commonly  quantized  into  a  finite  set  of  sym¬ 
bols,  {vi,V2,...,VAf},  called  the  codebook.  Each  state  is 
represented  by  a  discrete  probability  density  function  and 
each  symbol  has  an  associated  probability  of  occurring 
given  that  the  system  is  in  a  given  state.  In  other  words,  B 
becomes  a  simple  set  of  fixed  probabilities  for  each  state. 
That  is,  bi(Ot)  =  bi(1<)  =  Privj^qt  =  i),  where  is  the 
nearest  codebook  symbol  to  Ot . 

Given  the  form  of  the  hidden  Markov  model  defined  in 
(1),  Rabiner  [16]  defines  three  key  problems  of  interest 
that  must  be  solved  for  the  model  to  be  useful  in  real- 
world  applications:  (i)  the  classification  problem;  (ii)  the 
problem  of  finding  an  optimal  state  sequence;  and  (iii)  the 
problem  of  estimating  the  model  parameters. 

The  classification  problem  involves  computing  the 
probability  of  an  observation  sequence  O  =  {Oi, 
O2, . . . ,  Ot}  given  a  model  X,  i.e,  Pr(0\X).  This  probabil¬ 
ity  is  computed  efficiently  using  the  forward-backward 
procedure  [16]. 

In  most  applications,  it  often  turns  out  that  computing 
an  optimal  state  sequence  is  more  useful  than  Pr(0\X). 
There  are  several  possible  optimality  criteria.  One  that 
is  particularly  useful  is  to  maximize  Pr(0,  Q|A)  over  all 
possible  state  sequences  Q.  The  Viterbi  algorithm  [17]  is 
an  efficient  and  formal  technique  for  finding  this  optimal 
state  sequence  and  its  probability. 

The  third  problem  in  building  an  HMM  is  the  train¬ 
ing  problem:  how  does  one  estimate  the  parameters  of  the 
model?  The  problem  is  difficult  because  there  are  several 
levels  of  estimation  required  in  an  HMM.  First,  the  states 
themselves  must  be  estimated.  Then,  the  model  parame¬ 
ters  X  =  (A,  B,  7 r)  need  to  be  estimated.  For  the  discrete 
HMM,  first  the  codebook  is  determined,  usually  using 
the  K-means  algorithm  [18],  or  other  vector  quantization 
algorithms.  Then,  the  parameters  (A,  B,  n)  are  estimated 
iteratively  using  the  Baum-Welch  learning  algorithm  [19]. 

2.2  Baseline  HMM  classifier  for  landmine  detection 

The  baseline  HMM  classifier  for  GPR-based  landmine 
detection  was  first  introduced  in  [5].  It  consists  of  two 
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Fig- 1  Block  diagram  of  the  proposed  eHMM 


HMM  models,  one  for  mines  and  one  for  the  background. 
Each  model  has  four  states  and  produces  a  probability 
value  by  backtracking  through  model  states  using  the 
Viterbi  algorithm  [17].  The  mine  model,  A.m,  is  designed  to 
capture  the  spatial  distribution  of  the  features.  This  model 
assumes  that  mine  signatures  have  a  hyperbolic  shape 
comprised  of  a  succession  of  rising,  horizontal,  and  falling 
edges  with  variable  duration  in  each  state.  The  beginning 
and  the  end  of  the  observation  vectors  correspond  typi¬ 
cally  to  a  non-edge  (or  background)  state.  The  background 
model,  kb,  is  needed  to  capture  the  background  and  clut¬ 
ter  characteristics.  No  prior  information  or  assumptions 
are  used  for  this  model. 

The  architecture  of  the  baseline  HMM  classifier  is 
shown  in  Fig.  2.  Full  details  of  the  models  initialization 
and  training  can  be  found  in  [5].  The  probability  value 
produced  by  the  mine  (background)  model  can  be  thought 
of  as  an  estimate  of  the  probability  of  the  observation 
sequence  given  that  there  is  a  mine  (background)  present. 

The  confidence  value  assigned  to  each  observation 
sequence,  Conf(O),  depends  on:  (1)  the  probability 


assigned  by  the  mine  model,  Pr(0 \km);  (2)  the  probability 
assigned  by  the  background  model,  Pr(0\kc);  and  (3)  the 
optimal  state  sequence.  Thus, 

max  (log  >  o)  if  #{st  =  1,  t  =  1,  •  •  ■  ,  T)  <  Tmax 

0  otherwise 

(2) 

where  #{st  =  1,  t  =  1,  •  •  •  ,  T)  corresponds  to  the  number 
of  observations  assigned  to  the  background  state  (state  1). 
Tmax  is  defined  experimentally  based  on  the  shortest  mine 
signature.  Equation  (2)  ensures  that  sequences  with  a  large 
number  of  observations  assigned  to  state  1  are  considered 
non-mines. 

2.3  Extensions  to  the  baseline  HMM  for  landmine 
detection 

In  an  effort  to  improve  performance  and  generaliza¬ 
tion,  several  extensions  to  the  baseline  HMM  have  been 
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Fig.  2  Architecture  of  the  baseline  HMM  classifier  for  landmine  detection 


proposed.  For  instance,  in  [12,  13],  the  authors  pro¬ 
posed  the  multi-stream  HMM  (MSHMM)  that  com¬ 
bines  multiple  sets  of  features.  An  optimal  weight 
for  each  feature  was  learned  in  the  training  phase. 
In  [14],  maximum  likelihood  (ML)  and  minimization 
of  classification  error  (MCE)  learning  methods  were 
derived  for  the  MSHMM.  In  [15],  HMMs  with  stick¬ 
breaking  priors  (SBHMM)  [20]  were  employed  to  learn 
the  number  of  HMM  states  in  the  baseline  HMM 
landmine  detector.  This  approach  relies  on  a  variational 
Bayesian  learning  technique  in  lieu  of  standard  BW 
training. 

3  Motivations 

The  baseline  HMM  represents  each  class  by  a  single 
model  learned  from  all  the  observations  within  that 
class.  The  goal  is  to  generalize  from  all  the  training 
data  in  order  to  classify  unseen  observations.  However, 
for  complex  classification  problems  with  large  intra-class 
variations,  combining  observations  with  different  charac¬ 
teristics  to  learn  one  model  might  lead  to  too  much  aver¬ 
aging  thus,  lose  the  discriminating  characteristics  of  the 
observations. 

To  illustrate  this  problem,  we  use  the  example  of  detect¬ 
ing  buried  landmines  using  GPR  sensors1.  In  this  case,  the 
training  data  consists  of  a  set  of  N  GPR  alarms  labeled  as 
mines  (class  1)  or  clutter  (class  0).  The  goal  is  to  generalize 
from  the  training  data  in  order  to  classify  unlabeled  GPR 
signatures.  In  Fig.  3,  we  show  three  groups  of  mines  with 
different  signature  strengths.  It  is  obvious  that  grouping 
all  of  these  signatures,  to  learn  a  single  model,  would  lead 
to  poor  generalization.  Similarly,  the  false  alarms  could 
have  significant  variations  as  they  are  caused  by  different 
clutter  objects  and  varied  environment  conditions.  These 
issues  are  more  acute  when  data  are  collected  by  multiple 
sensors  and/or  using  various  features. 

Consequently,  learning  a  set  of  models  that  reflect  dif¬ 
ferent  characteristics  of  the  observations  might  be  more 
beneficial  than  using  one  global  model  for  each  class.  This 


is  typical  in  many  classifiers  such  as  the  K-NN  [9],  which 
uses  different  prototypes  for  each  class,  and  the  SVM  [10], 
which  uses  multiple  support  vectors. 

In  this  paper,  we  develop  a  new  approach  that  replaces 
the  two-model  classifier  with  one  that  includes  multiple 
models  for  each  class.  For  instance,  each  group  of  signa¬ 
tures  in  Fig.  3  would  be  used  to  learn  a  different  model. 
Our  approach  aims  to  capture  the  characteristics  of  the 
observations  that  would  be  lost  under  averaging  in  the 
two-model  case. 

We  hypothesize  that  under  realistic  conditions,  the 
data  are  generated  by  multiple  models.  The  proposed 
approach,  called  ensemble  HMM  (eHMM),  attempts  to 
partition  the  training  data  within  the  log-likelihood  space 
and  identify  multiple  clusters  in  an  unsupervised  man¬ 
ner.  Depending  on  each  clusters  homogeneity  and  size,  an 
appropriate  training  scheme  is  applied  to  learn  the  corre¬ 
sponding  HMM  parameters.  The  resulting  K  HMMs  are 
then  aggregated  through  a  decision  level  fusion  compo¬ 
nent  to  form  a  descriptive  model  for  the  data. 

4  Ensemble  HMM  architecture 

Let  O  =  {Onyr}^=1  he  a  set  of  R  labeled  sequences 

of  length  T  where  Or  =  jo^,  •  ,  O^j  and  yr  e 
{1,  •  •  ■  ,  C}  is  the  label  (class)  of  sequence  Or.  First,  we  need 
to  identify  subgroups  of  observations  that  have  common 
patterns.  Ground  truth  information  could  not  be  used  for 
this  task  as  it  is  insufficient  and  unreliable.  For  instance, 
a  large  deep  buried  mine  can  have  a  signature  similar  to 
a  small  shallow  buried  mine.  Furthermore,  the  same  mine 
buried  at  the  same  depth  in  soil  with  different  proper¬ 
ties  may  have  different  signatures.  Thus,  the  partitioning 
needs  to  be  done  in  an  unsupervised  way,  i.e.,  regardless  of 
the  observation  s  labels  and  the  limited  ground  truth  infor¬ 
mation.  In  our  approach,  we  use  unsupervised  learning  to 
cluster  the  set  of  all  observations,  O,  into  subgroups  of 
“similar”  observations.  The  first  step  in  this  approach  is  to 
define  a  measure  of  similarity  between  two  observations. 
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(a)  Mines  with  strong  signature.  (b)  Mines  with  average  signature. 


(c)  Mines  with  weak  signature. 

Fig.  3  Sample  mine  signatures  manually  categorized  into  three  groups,  a  Mines  with  strong  signature,  b  Mines  with  average  signature,  c  Mines  with 
weak  signature 


4.1  Similarity  between  observations  in  the  log-likelihood 
space 

4.1.1  Fitting  individual  models  to  sequences 

Initially,  each  sequence  in  the  training  data,  Or,  1  <  r  <  R 
is  used  to  learn  an  HMM  model  Ar.  Even  though  using 
only  one  sequence  of  observations  to  learn  an  HMM 
might  lead  to  over-fitting,  this  technique  is  only  an  inter¬ 
mediate  step  that  aims  to  capture  the  characteristics  of 
each  sequence.  The  produced  HMM  model  is  meant  to 
give  a  maximal  description  of  each  sequence,  and  there¬ 
fore,  over-fitting  is  not  an  issue  in  this  context.  In  fact, 
it  is  desired  that  the  model  perfectly  fits  the  observation 
sequence.  In  this  case,  the  likelihood  of  each  sequence 
with  respect  to  its  corresponding  model  is  expected  to  be 
higher  than  those  with  respect  to  the  remaining  models. 

Let  (40)  J  be  the  set  of  initial  models  and  let  s„\  1  < 

n  <  N,  be  the  representative  of  each  state  in  Each 
model  has  N  states.  First,  the  model  states  can  be  assigned 
to  the  sequence  observations  either  heuristically,  using 
domain  knowledge,  or  automatically  by  clustering  the 
sequence  observations  into  N  clusters.  In  our  approach, 
we  use  the  latter  and  we  define  the  states’  means  and 
observations  as  the  center  and  elements  of  each  resulting 
cluster,  respectively.  Consequently,  the  transition  matrix 
and  the  initial  probabilities  of  are  set  according  to  the 
aforementioned  associations.  For  the  emission  probabili¬ 
ties,  the  initialization  differs  whether  we  use  the  discrete 
or  continuous  HMM. 

For  the  discrete  case,  the  codewords  {vi,  •  •  •  vm)  of  the 
initial  individual  DHMM  model  are  the  actual  observa¬ 
tions  of  the  sequence  {Oi,  •  •  •  O^}.  The  emission  probabil¬ 
ity  of  each  codeword  in  each  state  is  inversely  proportional 
to  their  distance  to  the  mean  of  that  state.  We  use 


l 

b„(m)  =  J — ,  1  <  m  <  M,  1  <  n  <  N.  (3) 

£*= 1  l|vm-s;|| 

To  satisfy  the  requirement  that  Ylm= l  =  T  we 

normalize  the  values  using 

b„(m)  < - ,  1  <  m  <  M,  1  <  n  <  N.  (4) 

T!L  bn(D 

In  the  continuous  case,  the  emission  probability  den¬ 
sity  functions  are  modeled  by  mixtures  of  Gaussians.  In 
the  case  of  individual  sequence  models,  as  the  number  of 
observations  is  small,  we  use  a  single  component  mixture 
for  each  state.  Thus,  the  observations  belonging  to  each 
state  are  used  to  estimate  the  mean  and  covariance  of  that 
states  component.  We  use 

/in  =  mean  {Ot\Ot  e  sn} ,  1  <  n  <  N,  (5) 

and 

=  covariance  {Ot\Ot  e  sn) ,  1  <  n  <  N.  (6) 

Then,  the  Baum-Welch  algorithm  [21]  is  used  to  adapt  the 
model  parameters  to  each  given  observation.  Let  {Xr}f=1 
be  the  set  of  trained  individual  models. 

Next,  we  need  to  define  a  measure  that  evaluates  the 
similarity  between  pairs  of  observation  sequences.  While 
similarity  between  static  data  observations  is  straight¬ 
forward  and  well  defined,  defining  a  similarity  between 
observation  sequences  is  more  of  a  challenge.  Within  the 
context  of  HMM  modeling,  we  consider  two  observation 
sequences  similar  if:  (i)  they  fit  each  others  models;  and 
(ii)  they  have  similar  Viterbi  optimal  paths  [17]. 
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4. 7 .2  Log-likelihood-based  similarity 

The  log-likelihood,  L  (/,/),  of  sequence  0/  being  generated 
from  model  Xj  reflects  the  degree  to  which  0/  fits  Xj  and  is 
defined  as: 

L(i,j)  =  \ogPr(Ol\Xj).  (7) 

In  (7),  L  can  be  computed  using  the  forward-backward 
procedure  mentioned  in  Section  2.1.  When  the  log- 
likelihood  value  is  high,  it  is  likely  that  model  Xj  gener¬ 
ated  sequence  O/.  In  this  case,  sequences  0/  and  0/  are 
expected  to  have  common  salient  features  and  are  consid¬ 
ered  to  be  similar.  On  the  other  hand,  when  the  likelihood 
term  is  low,  it  is  unlikely  that  model  Xj  generated  the 
sequence  O/.  In  this  case,  0/  and  0/  are  considered  to  be 
dissimilar.  For  each  observation  sequence  Or,  1  <  r  <  R, 
we  compute  its  likelihood  in  each  model  Xp,  Pr(Or \Xp),  for 
1  <  p  <  R.  This  will  result  in  an  R  x  R  log-likelihood 
matrix. 

4. 7.3  Path-mismatch-based  penalty 

The  likelihood-based  similarity  may  not  be  always  accu¬ 
rate.  In  fact,  some  observations  can  have  high  likelihood 
in  a  visually  different  model.  This  occurs  when  most  of 
the  elements  of  a  sequence  partially  match  only  one  or 
two  of  the  states  of  the  model.  In  this  case,  the  observa¬ 
tion  sequence  can  have  a  high  likelihood  in  the  model  but 
its  optimal  Viterbi  path  will  deviate  from  the  typical  path. 
To  alleviate  this  problem,  we  introduce  a  penalty  term, 
P (/,/),  to  the  log-likelihood  measure  that  is  related  to  the 
mismatch  between  the  most  likely  sequence  of  hidden 
states  of  the  test  sequence  (O/)  and  that  of  the  generating 
sequence  (O/),  i.e., 

P (i,j)  =  DEdit  (QW,  Q^)  ,  1  <  i,j  <  R.  (8) 

In  (8),  P (itf)  is  the  distance  between  the  Viterbi  optimal 
path,  QVl\  of  testing  sequence  0/  with  model  Xj,  and  the 
Viterbi  optimal  path  of  testing  sequence  0/  with  model  Xj, 
Q^\  In  (8),  D£dit  is  the  “edit  distance”  [17]  ,  commonly 
used  in  string  comparisons.  The  “edit  distance”  between 
two  strings,  say  p  and  q,  is  defined  as  the  minimum  num¬ 
ber  of  single-character  edit  operations  (deletions,  inser¬ 
tions,  and/or  replacements)  that  would  convert  p  into  q. 
The  Viterbi  path  mismatch  term  is  intended  to  ensure 
that  similar  sequences  have  few  mismatches  in  their  cor¬ 
responding  Viterbi  optimal  paths.  Since  the  Viterbi  path  is 
already  available  when  using  the  forward-backward  pro¬ 
cedure  for  the  likelihood  computation,  the  penalty  term 
does  not  require  significant  additional  computation. 

Finally,  we  define  the  similarity,  S,  between  two 
sequences  Oj  and  0/  by  combining  (7)  and  (8): 

(9) 

In  (9),  the  mixing  factor,  a  e[  0, 1],  is  a  trade-off  param¬ 
eter  between  the  log-likelihood-based  similarity  and  the 


Viterbi-path-mismatch-based  dissimilarity.  It  is  estimated 
experimentally  by  maximizing  the  intra-class  similarity 
and  minimizing  the  inter-class  similarity  across  the  train¬ 
ing  data.  A  larger  value  of  a  corresponds  to  a  dominant 
log-likelihood-based  similarity  where  the  need  for  the 
penalty  mismatch  is  not  significant.  A  smaller  a  corre¬ 
sponds  to  a  more  significant  path  mismatch  penalty. 

Using  (9)  to  compute  the  similarity  between  all  pairs 
of  observations  results  in  a  similarity  matrix  that  is  not 
symmetric.  Thus,  we  use  the  following  three-step  sym- 
metrization  scheme  to  transform  it  into  a  pairwise  dis¬ 
tance  matrix: 

1.  D (i,j)  =  — S (i,j)  1  <  i,j  <  R 

<  2.  D(i,  i)  =  0,  1  <i<R  (10) 

3.  D (/,/)  =  max(D (/,/),  D(j,  i)),  1  <  i,j  <  R. 

4.2  Pairwise  distance-based  clustering 

The  distance  matrix,  computed  using  (10),  reflects  the 
degree  to  which  pairs  of  sequences  are  considered  similar. 
The  largest  variation  is  expected  to  be  between  sequences 
from  different  classes.  Other  significant  variations  may 
exist  within  the  same  class,  e.g.,  the  groups  of  signatures 
shown  in  Fig.  3.  Our  goal  is  to  identify  the  similar  groups 
so  that  one  model  can  be  learned  for  each  group.  This 
task  can  be  achieved  using  any  relational  clustering  algo¬ 
rithm.  In  our  work,  we  use  the  standard  agglomerative 
hierarchical  algorithm  [18]. 

Agglomerative  hierarchical  clustering  is  a  bottom-up 
approach  that  starts  with  each  data  point  as  a  cluster.  It 
then  proceeds  by  merging  the  most  similar  clusters  to  pro¬ 
duce  a  sequence  of  clusters.  Several  measures  have  been 
used  to  assess  the  similarity  between  clusters  [18].  Exam¬ 
ples  include  single  link,  complete  link,  average  link,  and 
ward  distance.  The  complete  link  method  tends  to  pro¬ 
duce  a  large  number  of  small  and  compact  clusters,  while 
the  single  link  method  is  known  to  result  in  few  “elon¬ 
gated”  clusters  with  large  number  of  points.  A  compro¬ 
mise  between  the  two  is  the  minimum-variance  distance, 
or  ward  distance  [22].  This  distance  is  defined  as 

niVij  „  ,,  o 

d(i,j)  =  - - -  Ci  -  Cj  (11) 

«»  +  «/ 

where  n >  and  are  the  cardinality  and  the  centroid  of 
cluster  C/c,  respectively.  It  has  been  shown  in  [17]  that  this 
approach  merges  the  two  clusters  that  lead  to  the  smallest 
increase  in  the  overall  variance. 

4.3  Ensemble  HMM  initialization  and  training 

The  previous  clustering  step  results  in  K  clusters,  each 
comprised  of  potentially  similar  sequences.  Each  cluster  is 
then  used  to  learn  an  HMM,  resulting  in  an  ensemble  of  K 
HMMs.  Let  A//c  denote  the  number  of  sequences  assigned 
to  the  same  cluster  k .  Since  our  clustering  step  did  not  use 
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class  labels,  clusters  may  include  sequences  from  differ- 
ent  classes.  Let  Nk  J  be  the  number  of  sequences  in  cluster 

k  that  belong  to  class  c,  such  that  J2c= l  =  ^4-  F°r 
instance,  for  the  landmine  example,  if  we  let  c  =  1  denote 
the  class  of  mines  and  c  =  0  denote  the  class  of  clutter, 
would  be  the  number  of  mines  assigned  to  cluster  k. 

The  next  step  of  our  approach  consists  of  learning  a 
set  of  HMMs  that  reflect  the  diversity  of  the  training 
data.  Since  a  cluster  contains  a  set  of  similar  sequences, 
and  each  cluster  may  include  observations  from  different 
classes,  we  learn  one  HMM  model  j  X^  j  for  each  set  of 

(c) 

sequences  assigned  to  class  c  within  cluster  k.  Let  = 

Ar(c) 

1  (c)  ( C )  1 

|  Or  ,yr  }  be  the  set  of  sequences  partitioned  into 

[ 

cluster  k  that  belong  to  class  c  and  let  \xy  \  be  their 
corresponding  individual  HMM  models,  ce  {1,  •  •  •  ,  C}. 

For  each  cluster,  we  devise  one  of  the  following  opti¬ 
mized  training  methods  based  on  the  clusters  size  and 
homogeneity. 

•  Clusters  dominated  by  sequences  from  only  one 
class:  In  this  case,  we  learn  only  one  model  for  this 
cluster.  The  sequences  within  this  cluster  are 
presumably  similar  and  belong  to  the  same  ground 
truth  class,  denoted  Q.  We  assume  that  this  cluster  is 
a  representative  of  that  particular  dominating  class.  It 
is  expected  that  the  class  conditional  posterior 
probability  is  uni-modal  and  peaks  around  the  MLE 
of  the  parameters.  Thus,  a  maximum  likelihood 
estimation  would  result  in  an  HMM  that  best  fits  this 
particular  class.  For  these  reasons,  we  use  the 
standard  Baum-Welch  re-estimation  procedure  [21]. 
Let  I<i  be  the  number  of  homogenous  clusters  that  fit 

into  this  category  and  let  i  =  1,  •  •  •  ,/<i  J 

denote  the  set  of  BW-trained  models. 

•  Clusters  with  a  mixture  of  observations 
belonging  to  different  classes:  In  this  case,  it  is 
expected  that  the  posterior  distribution  of  the 
classes  is  multi-modal.  Thus,  we  need  to  learn  one 
model  for  each  class  represented  in  this  cluster.  The 
MLE  approach  is  not  adequate,  and  more 
discriminative  learning  techniques  such  as  genetic 
algorithms  [23]  or  simulated  annealing  optimization 
[24]  are  needed  to  address  the  multimodality.  In  our 
work,  we  build  a  model  for  each  class  within  the 
cluster.  We  focus  on  finding  the  class  boundaries 
within  the  posteriors  rather  than  trying  to 
approximate  a  joint  posterior  probability.  Thus,  the 
models’  parameters  are  jointly  optimized  to  minimize 
the  overall  misclassification  error  using  a 
discriminative  learning  approach  [25].  Let  I< 2  be  the 
number  of  mixed  clusters  that  fit  into  this  category 


and  let  =  1,  •  •  •  ,  7<T2,  c  =  1,  •  •  •  ,  cj  be  the  set 

of  MCE-trained  models. 

•  Clusters  containing  a  small  number  of  sequences: 

The  MLE  and  MCE  learning  approaches  need  a  large 
number  of  data  points  to  give  robust  estimates  of  the 
model  parameters.  Thus,  when  a  cluster  has  few 
samples,  the  above  approaches  may  not  be  reliable. 
Ignoring  these  clusters  is  not  a  good  option  as  they 
may  contain  information  about  sequences  with 
distinctive  characteristics.  The  Bayesian  training 
framework  [26],  on  the  other  hand,  is  suitable  to  learn 
model  parameters  using  a  small  number  of  training 
sequences.  Specifically,  we  select  only  the  dominating 
class  for  this  cluster  and  learn  a  single  model  using  a 
variational  Bayesian  approach  [26]  to  approximate 
the  class  conditional  posterior  distribution.  Let  K3  be 
the  number  of  small  clusters  that  fit  into  this  category 
and  let  ^  —  h  •  •  •  ,  K3  j  denote  the  set  of 

Bayesian-trained  models. 


To  summarize,  for  each  homogenous  cluster  i,  we  define 

(C) 

one  model  X]  ,  i  =  1,  •  •  •  ,I< 1,  for  the  dominating  class 
Q.  For  mixed  clusters,  we  define  C  models  per  cluster: 
XjC\  c  =  1 ...  C,  j  =  1,  •  •  •  ,7< 2.  For  each  small  cluster, 

we  define  one  model  X^^  for  the  dominating  class  C/c. 
The  ensemble  HMM  mixture  is  defined  as  j  X^  J ,  where 


k  G  {1,  •  •  •  ,7<T},  and  c  =  Q  if  cluster  k  is  dominated 
by  sequences  labeled  with  class  C/c,  and  c  e  {1  •  •  •  ,  C}  if 
cluster  k  is  a  mixed  cluster. 

For  simplicity,  we  assume  that  all  models  Xy  have  a 

(c\ 

fixed  number  of  states  N.  For  each  model  Xy,  the  ini¬ 
tialization  step  consists  of  assigning  the  priors,  the  initial 
states  transition  probabilities,  and  the  states  parameters 
(initial  means  and  initial  emission  probabilities)  using 
observations  Or  and  their  respective  individual  models 
XrC\  r  G  1 1,  In  particular,  the  initial  val¬ 


ues  for  the  priors  and  the  state  transition  probabilities 
are  obtained  by  averaging,  respectively,  the  priors  and 
the  state  transition  probabilities  of  the  individual  models 


Xr°\r  G 


I1’- 


■(c) 


The  initialization  of  the  emission 


probabilities  in  each  state,  b£’c\  depends  on  whether  the 
HMM  is  discrete  or  continuous. 


•  Discrete  HMM  ( DHMM ):  the  state  representatives 

(c) 

and  the  codebook  of  model  Xk  are  obtained  by 
partitioning  and  quantizing  the  observations  O^.  . 
First,  sequences  from  cluster  k  that  belong  to  class  c, 
Or  ,  are  “unrolled”  to  form  a  vector  of  observations 
\j(kc)  0f  iength  Njf  T.  The  state  representatives, 

s^,c\  are  obtained  by  clustering  into  N  clusters 
and  taking  the  centroid  of  each  cluster  as  the  state 
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representative.  Similarly,  the  codebook  \^k,c^  = 

[v<M,  •  •  •  ,  i#*>]  is  obtained  by  clustering  into 

M  clusters.  For  each  symbol  v^,c\  the  membership  in 
each  state  s„<,c^  is  computed  using 


b™(m)=JtV”  ~SY  (12) 

^1=  1  \(k,c)  (*,c)|| 

vm 


To  satisfy  the  requirement  Em=i  b(n’c}  (m)  =  1,  we 
scale  the  values  by: 


b«’c\m) 


E/=l  bn’C)(l) 


(13) 


•  Continuous  HMM  ( CHMM ):  we  assume  that  each 
state  has  Ng  Gaussian  components.  For  each  model 
Xk  ,  as  in  the  discrete  case,  we  define  a  vector  of 
observations,  \J^k,c\  First,  is  partitioned  into  N 
clusters  and  the  center  of  cluster  n  is  taken  as  state 
.  Let  be  the  observations  assigned  to 
cluster  n.  Next,  we  partition  \J{k,c)  into  Ng  clusters 
using  the  k-means  algorithm  [27].  The  mean  of  each 
component,  iin’c,g\  is  the  center  of  one  of  the 
resulting  clusters,  and  the  covariance,  ’  ,  is 

estimated  using  the  observations  that  belong  to  that 
same  cluster.  If  we  denote  by  Un  ’ *  the  observations 
that  belong  to  component  g  of  state  s^’c\  the 

(c) 

parameters  of  Xk  are  computed  using 


{k,c,g) 

fin  =  mean 


{ \Jn'c,g} } ,  1  <  n  <  N,  1  <  g  <  Ng 


(14) 


4.4  Decision  level  fusion 

The  partial  confidence  values  of  the  different  models 
need  to  be  combined  into  a  single  confidence  value.  Let 
A  =  jxf XkB  |  be  the  resulting  mixture  model 
composed  of  a  total  of  I<  models,  I<  =  K\  +  K 2  +  K3. 

Let  F(/c,  r)  =  logPr  (Or|A.^),  1  <  r  <  R,  1  <  k  <  7<T, 
be  the  log-likelihood  matrix  obtained  by  testing  the  R 
training  sequences  with  the  K  models.  Each  column  fr  of 
matrix  F  represents  the  feature  vector  of  each  sequence  in 
the  decision  space  (recall  that  fr  is  a  K-dimensional  vector 
while  Or  is  a  sequence  of  vector  observations  of  length  T ). 
In  other  words,  each  column  represents  the  confidences 
assigned  by  the  K  models  to  each  sequence  r.  Therefore, 
the  set  of  sequences  G  =  {Or,yr}f=1  is  mapped  to  a  con¬ 
fidence  space  {fr,yr}^=1.  Finally,  a  combination  function, 
El,  takes  all  the  f/s  as  input  and  outputs  the  final  deci¬ 
sion.  The  general  framework  for  fusing  the  K  outputs  is 
highlighted  in  Algorithm  1. 


Algorithm  1  Testing  a  new  sequence  using  the  eHMM 
Require:  Test  observation  O 

Ensure: 

l:  Compute  Pr  for  the  K\  clusters  learned  with 

BW 

2:  Compute  Pr  =  maxcPr  for  the 

I< 2  clusters  learned  with  MCE 
3:  Compute  Pr  (0|A.^)  for  the  K3  clusters  learned  with 
VB 

4:  Combine  the  outputs  of  the  multiple  mod- 
els:  Pr(0\A)  =  M(Pr(0\X?w),  Pr  (o\XfCE ), 

M01T8)) 


'Ln’c,g)  =  covariance  J  j ,  1  <  «  <  Ah  1  <  g  <  N„. 

(15) 

For  both  the  discrete  and  continuous  cases,  any  clus¬ 
tering  algorithm,  such  as  the  K-means  [27]  or  the  fuzzy 
c-means  [28],  could  be  used  to  identify  the  states,  code¬ 
book,  or  the  multiple  components.  After  initialization, 
we  use  one  of  the  training  schemes  described  earlier,  to 
update  Xk  parameters  using  the  respective  observations 

G[c\  k  e  {1,  •  •  •  ,/<T},  c  e  {1,  •  •  •  ,  C}.  As  mentioned  ear¬ 
lier,  for  homogenous  clusters,  BW  training  results  in  one 
model  kBW  per  cluster;  for  mixed  clusters,  MCE  training 
results  in  C  models  per  cluster,  A.^C£,  c  =  1 . . .  C;  and  for 
small  clusters,  variational  Bayesian  learning  results  in  one 
model  per  cluster,  XVB.  The  output  of  Baum- Welch-  and 
VB-trained  cluster  models  is  Pr(0 |A.^)  while  the  output  of 
the  MCE-trained  cluster  models  is  maxc  Pr  (o\X^E\ 


Several  decision  level  fusion  techniques  such  as  simple 
algebraic  [29],  artificial  neural  networks  (ANN)  [30],  and 
hierarchical  mixture  of  experts  (HME)  [31]  can  be  used. 
In  our  work,  we  use  an  ANN  with  a  single-layer  percep- 
tron  and  no  hidden  layers.  The  ANN  weights  are  learned 
from  the  labeled  training  data  using  the  backpropagation 
algorithm  [32]. 

The  architecture  of  the  proposed  eHMM  is  summarized 
in  Fig.  1.  It  is  composed  of  four  main  components:  simi¬ 
larity  matrix  computation,  relational  clustering,  adaptive 
training  scheme,  and  decision  level  fusion.  To  test  a  new 
sequence,  the  outputs  of  the  different  models  are  aggre¬ 
gated  into  a  single  confidence  value  using  Algorithm  1. 

5  Application  to  landmine  detection  using 
ground-penetrating  radar  data 

5.1  Data  collections 

The  proposed  eHMM  was  implemented  and  tested  on 
GPR  data  collected  with  a  NIITEK  vehicle  mounted  GPR 
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system  [33]  (see  Fig.  4).  This  system  collects  51  chan¬ 
nels  of  data.  Adjacent  channels  are  spaced  approximately 
five  centimeters  apart  in  the  cross-track  direction,  and 
sequences  (or  scans)  are  taken  at  approximately  1-cm 
down-track  intervals.  Each  A-scan,  that  is,  the  measured 
waveform  that  is  collected  in  one  channel  at  one  down- 
track  position,  contains  416  time  samples  at  which  the 
GPR  signal  return  is  recorded.  We  often  refer  to  the  time 
index  as  depth  although,  since  the  radar  wave  is  travel¬ 
ing  through  different  media,  this  index  does  not  represent 
a  uniform  sampling  of  depth.  Thus,  we  model  an  entire 
collection  of  input  data  as  a  three-dimensional  matrix  of 
sample  values,  S(z,x,y),z  =  1,  •  •  •  ,416;#  =  1,  •  •  •  ,51  ;y  = 
1,  •  •  •  ,  Ns,  where  Ns  is  the  total  number  of  collected  scans, 
and  the  indices  z,  x,  and  y  represent  depth,  cross-track 
position,  and  down-track  positions  respectively.  A  collec¬ 
tion  of  scans,  forming  a  volume  of  data,  is  illustrated  in 
Fig.  5. 

Figure  6  displays  several  B-scans  (sequences  of  A- 
scans)  both  down-track  (formed  from  a  time  sequence 
of  A-scans  from  a  single  sensor  channel)  and  cross-track 
(formed  from  each  channels  response  in  a  single  sample). 
The  surveyed  object  position  is  highlighted  in  each  figure. 
The  objects  scanned  are  (a)  a  high-metal  content  antitank 
mine,  (b)  a  low-metal  antipersonnel  mine,  and  (c)  a  wood 
block. 

Raw  GPR  data  needs  to  be  preprocessed  and  pre¬ 
screened.  Preprocessing  includes  ground-level  alignment 
and  signal  and  noise  background  removal.  Prescreening 
is  needed  to  focus  attention  and  identify  regions  with 
subsurface  anomalies.  For  this  step,  we  use  the  adaptive 
least  mean  squares  (LMS)  prescreener  [34].  The  LMS  flags 
locations  of  interest  utilizing  a  computationally  inexpen¬ 
sive  algorithm  so  that  more  advanced  algorithms  can  be 
applied  only  on  the  small  subsets  of  data  flagged  by  the 
prescreener. 

In  our  experiments,  data  sets  are  comprised  of  a  variety 
of  mine  and  background  signatures.  In  particular,  we  use 


Fig.  5  A  collection  of  few  GPR  scans 


data  collected  from  outdoor  test  lanes  at  three  different 
locations.  The  first  two  locations,  site  1  and  site  2,  were 
temperate  regions  with  significant  rainfall,  whereas  the 
third  collection,  site  3,  was  a  desert  region.  The  lanes  are 
simulated  roads  with  known  mine  locations.  Multiple  data 
collections  were  performed  at  each  site  at  different  dates. 
The  statistics  of  these  data  sets  are  reported  in  Table  1. 
Data  cubes  of  size  (15  scans,  7  channels,  416  depths)  were 
extracted  from  each  scan  position  flagged  by  the  pre¬ 
screener  and  are  presented  to  the  classifier  to  discriminate 
between  mines  and  false  alarms. 

5.2  Feature  extraction 

The  goal  of  the  feature  extraction  step  is  to  transform 
original  GPR  data  into  a  sequence  of  observation  vectors. 
We  use  two  types  of  features  that  have  been  proposed 
and  used  independently.  Each  feature  represents  a  differ¬ 
ent  interpretation  of  the  raw  data  and  aims  at  providing  a 
good  discrimination  between  mine  and  clutter  signatures. 
These  features  are  outlined  in  the  following  subsections. 

5.2. 7  EHD  features 

This  feature  is  based  on  the  edge  histogram  descriptor 
[9]  (EHD)  and  characterizes  edges  in  the  spatial  domain. 


Table  1  Data  collections 


Total  number  of 
prescreened  alarms 

Mine  encounters 

False  alarms 

2477 

732 

1745 

£>z 

1343 

724 

619 

£3 

1843 

613 

1230 
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Fig.  6  NIITEK  radar  down-track  and  cross-track  (at  position  indicated  by  a  line  in  the  down-track)  B-scans  pairs  for  a  an  Anti-Tank  (AT)  mine,  b  an 
Anti-Personnel  (AP)  mine,  and  c  a  non-metal  clutter  alarm 


The  EHD  captures  the  salient  properties  of  the  3D  alarms 
in  a  compact  and  translation  invariant  representation. 
It  extracts  edge  histograms  capturing  the  frequency  of 
occurrence  of  edge  orientations  in  the  data  associated 
with  a  ground  position.  Simple  edge  detector  operators 
are  used  to  identify  edges  and  group  them  into  five  cat¬ 
egories:  vertical,  horizontal,  diagonal,  anti-diagonal,  and 
isotropic  (non-edge).  Each  B-scan  position  is  then  rep¬ 
resented  by  a  five-dimensional  observation  vector.  Each 
dimension  of  this  vector  represents  the  percentage  of 
pixels  (in  a  small  interval  along  the  depth)  that  belong  to 
each  of  the  five  edge  categories. 

5.2.2  Gabor  features 

Gabor  features  characterize  edges  in  the  frequency 
domain  at  multiple  scales  and  orientations  and  are  based 
on  Gabor  wavelets  [7].  This  feature  is  extracted  by  expand¬ 
ing  the  signature  s  B-scan  (depth  vs.  down-track)  using 
a  bank  of  scale  and  orientation  selective  Gabor  filters. 
Expanding  a  signal  using  Gabor  filters  provides  a  local¬ 
ized  frequency  description.  In  our  experiments,  we  use  a 
bank  of  filters  tuned  to  the  combination  of  three  scales  and 
four  orientations.  Each  observation  is  then  represented  by 
a  12-dimension  feature  vector. 

5.3  Ensemble  HMM  implementation  and  results 

In  all  experiments  reported  in  this  paper,  we  use  a  sixfold 
cross  validation  for  each  data  collection  l  e  {1,2,3}. 
For  each  fold,  a  subset  of  the  data  (D[t>«)  is  used  for 
training  and  the  remaining  data  ( ®iTst )  is  used  for  testing. 


O  ijfa  denotes  the  set  of  observation  sequences  extracted 
from  dataset  D  [,  using  one  of  the  feature  extraction  meth¬ 
ods,  “Feat”  (EHD  or  Gabor). 

The  first  step  of  the  eHMM  is  the  similarity  matrix  com¬ 
putation.  This  step  requires  fitting  an  individual  HMM 
model  for  each  sequence  in  the  training  data  Qijff- 
Figure  7  shows  the  log-likelihood  and  path  mismatch 
penalty  matrices  for  a  training  collection  that  has  521 
mines  and  1471  clutter  signatures  (first  training  fold  of 
Di  using  EHD  features,  Dj_f^).  In  these  figures,  the 
indices  are  rearranged  so  that  the  first  entries  correspond 
to  the  mine  signatures  and  the  latter  ones  correspond  to 
non-mine  signatures.  As  it  can  be  seen,  the  matrices  are 
composed  mainly  of  four  blocks.  The  diagonal  blocks  cor¬ 
respond  to  testing  mine  signatures  in  mine  models  and 
non-mine  signatures  in  non-mine  models,  and  the  off- 
diagonal  blocks  correspond  to  testing  mine  signatures  in 
non-mine  models  and  non-mine  signatures  in  mine  mod¬ 
els.  In  these  figures,  dark  pixels  correspond  to  small  values 
of  the  log-likelihood  or  path  mismatch  penalty  and  bright 
pixels  correspond  to  larger  entries  of  the  correspond¬ 
ing  matrices.  Note  that  in  the  case  of  the  log-likelihood 
matrix  in  Fig.  7a,  the  diagonal  blocks  are  brighter  than 
the  off-diagonal  blocks.  This  means  that  the  signatures 
from  the  same  class  are  more  similar  to  each  other  than 
to  signatures  from  different  classes.  Similarly,  in  the  path 
mismatch  penalty  matrix  of  Fig.  7b,  the  diagonal  blocks 
are  darker  than  the  off-diagonal  blocks.  This  means  that 
when  different  mines  are  tested  with  each  other  models, 
the  paths  are  similar.  The  above  observations  are  trivial 
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Mines  False  Alarms 


(a)  Log-likelihood  (b)  path  mismatch  penalty 

Fig.  7  Log-likelihood  and  path  mismatch  penalty  matrices  for  a  large  collection  of  mine  and  clutter  signatures,  a  Log-likelihood  true  and  b  path 
mismatch  penalty 


as  alarms  from  the  same  class  are  expected  to  be  more 
similar  to  each  other  than  alarms  from  different  classes. 
However,  they  could  be  used  to  validate  our  similarity  (and 
penalty)  measures  in  the  log-likelihood  space.  A  more 
important  observation  is  that  within  each  diagonal  block, 
sub-blocks  can  be  extracted.  This  is  an  indication  of  the 
existence  of  different  clusters  within  the  mines  (and  the 
clutter)  themselves. 

In  the  second  step,  the  similarity  matrix  is  transformed 
into  a  distance  matrix,  D,  using  (10).  The  hierarchi¬ 
cal  clustering  algorithm  [18]  is  then  applied,  using  D 
with  a  fixed  number  of  clusters  I<  =  10,  to  identify 
sub-categories  within  the  training  data.  For  both  the 
discrete  and  continuous  versions,  using  any  of  the  fea¬ 
tures  and  datasets,  the  eHMM  clustering  step  success¬ 
fully  assigns  groups  of  similar  alarms  into  clusters.  For 
instance,  in  Fig.  9,  we  show  the  hierarchical  clustering 
results  of  the  first  crossvalidation  fold  of  the  eCHMM 
using  the  EHD  features  on  dataset  Dj,.  As  it  can  be  seen 


in  Fig.  9a,  we  have  a  group  of  clutter  dominated  clus¬ 
ters  (in  brown)  and  a  second  group  of  clusters  dominated 
by  mines  (in  blue).  In  Fig.  8,  we  show  sample  signatures 
that  belong  to  clusters  1,  6,  and  10.  As  it  can  be  seen 
from  Figs.  8  and  9a,  cluster  1  has  only  clutter  and  clus¬ 
ters  6  and  10  are  composed  exclusively  of  mine  alarms. 
The  mines  that  belong  to  cluster  6  have  typically  strong 
mine  signatures.  These  mines,  as  shown  in  Fig.  9b,  c, 
are  typically  mines  with  high  metal  content  that  are  buried 
at  shallow  depths.  The  mines  that  belong  to  cluster  10 
have  weak  GPR  signatures.  These  mines,  as  shown  in 
Fig.  9b,  c,  are  typically  mines  with  weak  signatures  that 
are  either  low  metal  mines  or  mines  buried  at  deep 
depths. 

Additional  details  of  the  clusters’  contents  per  mine  type 
and  per  burial  depth  are  shown  in  Fig.  9b,  c.  To  summa¬ 
rize,  the  training  data  includes  four  homogeneous  clusters 
(Clusters  6,  7,  and  10  contain  only  mines  and  cluster  1  has 
only  clutter).  The  remaining  clusters  (2,  3,  4,  5,  8,  and  9) 
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Distribution  of  the  alarms  in  each  cluster  per  class  {mines  vs.  clutter) 


1  234  5  6789  10 

Distribution  of  the  mines  in  each  cluster  per  type  (Low  Metal  vs.  High  Metal) 


Distribution  of  the  mine  signatures  in  each  cluster  per  burial  depth 

Fig.  9  eCHMM  hierarchical  clustering  results  of  Di™ :  distribution  of  the  alarms  in  each  cluster:  a  per  class,  b  per  type,  c  per  depth 


are  mixed.  Therefore,  using  the  notation  of  Algorithm  1, 
we  define  our  eHMM  as: 

'  yBW  _  \y(F)  AM)  ,  (M)  .  (A*)l 
Ai  ~  jAl  >a6  >A7  >A10  J> 

■  )MCE  =  j xjM), x)C) } e  {2, 3, 4, 5, 8, 9} ,  (16) 

A  =  0- 

In  Table  2,  we  report  the  means  and  the  weights  of 
the  components  of  each  state  of  the  BW-trained  eCHMM 
model  for  cluster  6,  as  well  as  its  transition  proba¬ 
bility  matrix.  Cluster  6  contains  “typical”  mines  that  have 


strong-edge  and  near-perfect  hyperbolic  shape  signatures 
with  succession  of  states  s i,  $2,  and  S3.  Recall  that  si,  S2, 
and  S3  correspond  respectively  to  the  rising  (Dg),  flat  (Hz), 
and  falling  (Ad)  edges  within  the  mine  signature.  There¬ 
fore,  all  the  components  ofsi  (resp.  S3)  have  their  diagonal 
edge  higher  (resp.  lower)  than  the  anti-diagonal  one.  Sim¬ 
ilarly,  components  of  S2  have  higher  horizontal  edge  and 
comparable  diagonal  and  anti-diagonal  edges.  As  it  can  be 
seen  in  the  transition  matrix  of  Table  2,  the  probability  of 
staying  in  si  (resp.  S2)  is  approximately  three  times  (resp. 
two  times)  the  probability  of  moving  to  S2  (resp.  S3). 


Table  2  k%  CHMM  model  parameters  of  cluster  6 


Means 

H 

V 

D 

A 

N 

Weights 

w 

Si 

C11 

0.21 

0.17 

0.41 

0.07 

0.13 

9n 

0.30 

Cl  2 

0.36 

0.12 

0.25 

0.11 

0.17 

9n 

0.22 

Cl  3 

0.15 

0.18 

0.23 

0.06 

0.37 

013 

0.49 

52 

C21 

0.42 

0.09 

0.25 

0.12 

0.13 

gn 

0.30 

C22 

0.37 

0.10 

0.10 

0.30 

0.13 

922 

0.20 

C23 

0.59 

0.05 

0.10 

0.12 

0.14 

923 

0.50 

S3 

C31 

0.38 

0.11 

0.10 

0.26 

0.16 

93 1 

0.21 

C32 

0.20 

0.17 

0.07 

0.43 

0.13 

932 

0.30 

C33 

0.14 

0.20 

0.06 

0.25 

0.36 

933 

0.49 

A 

Si 

52 

S3 

Si 

0.73 

0.27 

0.00 

S2 

0.00 

0.67 

0.33 

S3 

0.00 

0.00 

1.00 

Table  3  CHMM  model  parameters  of  cluster  10 


Means 

H 

V 

D 

A 

N 

Weights 

w 

Si 

C11 

0.14 

0.13 

0.17 

0.08 

0.48 

9n 

0.27 

Cl  2 

0.26 

0.11 

0.20 

0.06 

0.37 

9n 

0.40 

Cl  3 

0.16 

0.04 

0.10 

0.05 

0.66 

913 

0.32 

S2 

C21 

0.30 

0.07 

0.10 

0.14 

0.39 

921 

0.50 

C22 

0.48 

0.05 

0.11 

0.14 

0.22 

922 

0.28 

C23 

0.27 

0.12 

0.07 

0.37 

0.16 

923 

0.21 

S3 

C31 

0.09 

0.11 

0.03 

0.18 

0.59 

931 

0.60 

C32 

0.22 

0.17 

0.05 

0.36 

0.20 

932 

0.04 

C33 

0.10 

0.20 

0.02 

0.31 

0.36 

933 

0.36 

A 

Si 

S2 

S3 

Si 

0.74 

0.26 

0.00 

S2 

0.00 

0.89 

0.11 

S3 

0.00 

0.00 

1.00 
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Fig.  10  Scatter  plot  of  the  confidences  of  the  training  data  in  cluster  model  k^  (strong  mines)  vs.  cluster  model  k\^  (weak  mines) 


Table  3  shows  the  BW-trained  eCHMM  model  for  clus¬ 
ter  10,  X^\  Recall  that  cluster  10  contains  only  mine 
signatures  that  have  a  low  metal  content  and/or  are  buried 
at  4"  or  deeper,  as  it  can  be  seen  in  Fig.  9c.  Therefore, 
the  alarms  in  cluster  10  are  expected  to  have  weak  sig¬ 
natures  and  weak  edge  features.  This  could  explain  the 
large  non-edge  component  of  most  of  the  states’  means 
components  of  X^  reported  in  Table  3,  compared  to 
the  non-edge  components  of  X^’s  states  representatives. 
Nevertheless,  the  states  representatives  still  characterize 
the  hyperbolic  shape  of  a  typical  mine  signature,  i.e., 
the  succession  of  Dg  —  Hz  —  Ad  states.  For  instance,  all 
si  components  means  have  their  diagonal  D  dimension 
larger  than  the  anti-diagonal  A  dimension.  For  the  tran¬ 
sition  matrix,  we  notice  that  XyP  is  more  stationary  in 
S2,  with  a  probability  of  0.89,  compared  to  X^\  This 
means  that,  on  average,  sequences  belonging  to  cluster  10 
have  a  large  number  of  observations  with  flat  edge  and 


fewer  observations  with  strong  diagonal  or  anti-diagonal 
edges. 

Figure  10  shows  the  scatter  plot  of  the  confidences 
assigned  by  X ^  and  Xy^  to  all  the  training  data.  In 
this  figure,  we  display  clutter  and  mine  signatures  that 
belong  to  each  cluster  using  different  symbols  and  colors. 
Even  though  the  two  models  are  dominated  by  mine  sig¬ 
natures,  we  see  that  not  all  confidence  values  are  highly 
correlated.  On  one  hand,  some  strong  mine  signatures, 
particularly  those  belonging  to  cluster  6,  have  high  log- 
likelihoods  in  model  X^  and  lower  log-likelihoods  in 
model  XyP  (lower  right  side  of  the  scatter  plot,  region  Rl). 
This  can  be  attributed  to  the  fact  that  cluster  6  contains 
mainly  strong  mines  and  is  more  likely  to  yield  high  log- 
likelihood  when  testing  a  strong  mine  signature.  On  the 
other  hand,  in  region  R2,  the  performance  of  Xy^  is  better 
as  it  gives  higher  likelihood  values  to  the  “weak”  mines 
in  that  region,  particularly  those  belonging  to  cluster  10. 
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In  fact,  this  result  is  expected  because  cluster  10  contains 
weak  mine  signatures. 

The  main  conclusion  that  we  can  draw  from  the  above 
example  is  that  and  k^  are  very  different  and  char¬ 
acterize  two  distinct  subsets  of  the  training  data.  The 
standard  HMM  approach  would  combine  all  alarms  to 
learn  a  single  model  for  mines  (weak  and  strong)  and  a 
single  model  for  clutter. 

In  the  final  step,  the  eHMM  mixture  is  combined  using 
a  single-layer  ANN.  The  ANN  parameters  are  trained  to 
fit  the  responses  of  the  eHMM  mixture  models  to  the 
training  data  labels. 

5.4  eHMM  vs.  baseline  HMM  results 

In  this  section,  we  compare  the  performance  of  the  pro¬ 
posed  eHMM  to  the  baseline  HMM  [5].  For  the  eHMM, 
we  show  the  results  using  the  ANN  fusion  and  the 
hierarchical  agglomerative  clustering  with  I<  =  20.  In 
Fig.  11a,  b,  we  show  the  ROCs  generated  by  the  dis¬ 
crete  versions  of  the  eHMM  and  the  baseline  HMM  on 
dataset  T)x,  using  EHD  and  Gabor  features.  Similarly,  in 
Fig.  12a,  b,  we  report  the  ROCs  generated  by  the  contin¬ 
uous  versions,  i.e.,  the  eCHMM  and  the  baseline  CHMM, 
on  dataset  Dj_.  As  it  can  be  seen,  in  all  the  ROCs  of  Figs.  11 
and  12,  at  a  given  false  alarm  rate  (FAR),  the  eHMM 
has  a  better  probability  of  detecting  targets.  For  instance, 
in  Fig.  11a,  at  a  FAR  of  10  %,  the  eDHMM  using  EHD 
features  successfully  identifies  94  %  of  the  mines  while 
the  baseline  DHMM  identifies  only  87  %  of  the  targets. 
At  the  same  FAR  of  10  %,  the  ROCs  of  Fig.  12a  show 
that  the  eCHMM  successfully  identifies  95  %  of  the  tar¬ 
gets  while  the  baseline  CHMM  probability  of  detection 
is  85  %. 

The  results  for  all  three  datasets  are  summarized 
in  terms  of  the  Area  Under  ROC  Curve  (AUC)  and 
are  reported  in  Table  4.  As  it  can  be  seen,  in  all 
experiments,  the  eHMM  outperforms  the  baseline 
HMM. 


6  Conclusions 

In  this  work,  we  have  proposed  a  novel  ensemble 
HMM  classification  method  that  is  based  on  clustering 
sequences  in  the  log-likelihood  space.  The  eHMM  uses 
multiple  HMM  models  and  fuses  them  for  final  deci¬ 
sion  making.  We  hypothesized  that  the  data  are  generated 
by  multiple  models.  These  different  models  reflect  the 
fact  that  samples  from  the  same  class  can  have  different 
characteristics  resulting  in  large  intra-class  variability. 

The  eHMM,  in  its  discrete  and  continuous  versions, 
was  implemented  and  evaluated  using  large  collections  of 
landmine  GPR  data.  We  examined  the  intermediate  steps 
of  the  eHMM  and  compared  its  performance  to  the  base¬ 
line  HMM.  Results  on  three  GPR  data  collections  show 
that  the  proposed  method  can  identify  meaningful  and 
coherent  HMM  mixture  models  that  describe  different 
properties  of  the  data.  Each  individual  HMM  character¬ 
izes  a  group  of  data  that  share  common  attributes.  The 
experiments  show  that  the  proposed  eHMM  intermediate 
results  are  inline  with  the  expected  behavior.  The  results 


Table  4  AUC  of  the  ensemble  HMM  and  baseline  HMM  classifiers 


Dataset 

Classifier  using: 

EHD 

Gabor 

2>X 

Ensemble  DHMM 

712 

719 

Baseline  DHMM 

643 

499 

Ensemble  CHMM 

718 

617 

Baseline  CHMM 

614 

472 

Ensemble  DHMM 

402 

127 

Baseline  DHMM 

107 

30 

Ensemble  CHMM 

359 

122 

Baseline  CHMM 

209 

102 

Ensemble  DHMM 

343 

296 

Baseline  DHMM 

272 

122 

Ensemble  CHMM 

326 

226 

Baseline  CHMM 

284 

140 
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also  indicate  that,  for  both  the  continuous  and  discrete 
versions,  the  proposed  method  outperforms  the  baseline 
HMM  that  uses  one  model  for  each  class  in  the  data. 

Endnote 

1  The  details  of  the  landmine  detection  application 
using  GPR  signatures  will  be  presented  in  section  5. 
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